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Abstract 

Although power laws of the Zipf type have been used by many workers to fit rank 
distributions in different fields like in economy, geophysics, genetics, soft-matter, 
networks etc., these fits usually fail at the tails. Some distributions have been pro- 
posed to solve the problem, but unfortunately they do not fit at the same time 
both ending tails. We show that many different data in rank laws, like in granu- 
lar materials, codons, author impact in scientific journal, etc. are very well fitted 
by a beta-like function. Then we propose that such universality is due to the fact 
that a system made from many subsystems or choices, imply stretched exponential 
frequency-rank functions which qualitatively and quantitatively can be fitted with 
the proposed beta-like function distribution in the limit of many random variables. 
We prove this by transforming the problem into an algebraic one: finding the rank 
of successive products of a given set of numbers. 
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1 Introduction 



Both natural language texts and coding DNA sequences present power laws 
in the observed frequency of a word as a function of its rank (r), where the 
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rank is just the ordinal position of a word if all words are ordered according 
to their decreasing frequency. Usually, the most frequent word has rank 1, the 
next most frequent rank 2 and so on. This power law behavior of the ranking is 
known as the Zipf law [1], and it is very common in physics, biology, geography, 
economics, linguistics, etc. [1]. In physics one can cite the rank distribution of 
stick-slip events in sheared granular media [2], earthquakes [2], radionuclides 
half-life time and nuclides mass number [3]. Many complex systems share as 
well the same phenomenology, as happens in networks [4] , biological clocks [5] 
and metabolic networks [6] . Zipf discovered his rank law by analyzing manually 
the frequencies of 29,899 different words types in the novel "Ulysses" by James 
Joyce, but when a larger set of words is considered, a deviation from a power 
law is observed for larger ranks [7]. A similar behavior is found in coding 
genetic sequences. Deviations from the Zipf law are also found in the tails 
ranking of many physical systems [8]. In fact, is clear that one should expect 
a different behavior at the tails, since finite size effects should be present and 
the power law must be "stopped" at a certain region. In spite of this, many 
workers just ignore the tail effects by fitting the data in a restricted range, 
or they proceed in a very questionable way by fitting all the data with a 
power law. Others have fitted sets of data in nature and in economy with 
stretched exponentials [8] and log- normal distributions [9]. The problem with 
the previous expressions is that they do not fit the data at both ending tails, 
where different kinds of processes are set in once a crossover region is reached. 
Such crossovers are due to finite size effects, in which different mechanisms 
are set in when certain big and small scales are reached. This leads to the 
idea of using multiscaling physical modelling to understand such features. 
Maybe the best example of the previous situation occurs in turbulence, where 
Kolmogorov's power law is observed only in the inertial regimen [10] [11]. In 
one tail (small length scales) energy dissipation plays the main role, while 
energy injection dominates at big scales [10] [11]. For each of these limits, the 
scaling behavior is different [12] [13]. One can conjecture that similar ideas 
are behind many other complex physical systems, since we report that many 
rank laws are extremely well parametrized, outperforming many other rank- 
order models, with a two exponent beta function-like formula with parameters 
{a,b}, 

m=x { r a > (!) 

where a and b are fitted from the data, r is the rank and R is the maximal r. If 
f(r) is normalized to 1, then K = 1/X^Li (R — r + l) b /r a . For R 1, K can 
be transformed into an integral that yields K Y(b — a + 2)/T(l — a)r(l + 6). 
We will show that f(r) is related with a kind of central limit theorem, in which 
a and b seem to be parameters related with the onset of different mechanisms. 
Our work is in the same spirit of Moyano et. al. [14], who have commented 
that the rather ubiquitous presence of the Tsallis g-distributions is maybe 
due to a g-generalized central limit theorem for a class of non independent, 
correlated, product of probability distributions [15]. The outline of this pa- 
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Fig. 1. Population ranking of four representative municipalites from Mexico and 
Spain. The solid lines are the fits obtained from Eq. (1) The inset presents the 
corresponding values of a and b used in the fits. 

per is the following: in section II we present some representative examples of 
the phenomenology that we have observed. In section III we show how this 
phenomenology can be studied as a problem of hierarchies in the product 
of random variables, and then transformed into a related algebraic problem: 
what is the rank of a set of numbers produced by the iterative product of an 
initial finite set of numbers. In section IV, we solve the proposed problem, and 
finally, in section V we give the conclusions of this work. 



2 Phenomenology of rank laws and the beta-like function 

As starting point, we will provide some representative results of the wide 
phenomenology found in the tails of rank laws. We start with an example 
from geography. Fig. 1 shows the population ranking of four representative 
municipalities in Mexico and Spain in a semilog plot. The corresponding fits 
using Eq. (1) are given by solid lines. The agreement is excellent, with a 
correlation coefficient R bigger than 0.98 for all fits. The values of a and b for 
each fit are shown in the inset of the plot. We have verified that similar good 
results are obtained for the population of countries and states. 

Figure 2 shows the impact factor against the rank of scientific journals, taken 
from a recent study [16], compared with the fits given by Eq. (1). Again, all 
the fits are excellent, with correlation coefficients above 0.98. 

Similar excellent fitting results are obtained for codon usage in genomes, as 
shown in Fig. 3, where we plot the logarithm of the frequency of codons 
(normalized to 1000) as a function of the rank for different representative 
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Fig. 2. Impact factor as a function of the rank for physics, computer science and 
agroscience. Fits using the beta-like function are shown as solid lines. Inset: values 
of a and b. 



Fig. 3. Frequency of codons (normalized to 1000) as a function of the rank for the 
genome of four different species, with their correponding fits shown as solid lines. 
Inset: values of a a and b used for the fits in the beta-like distribution. 

organisms, taken from a genome database [17]. For all the organisms, the 
resulting correlation parameters are bigger than 0.97. 

Now we turn our attention to physics. In Fig. 4 we plot the rank-ordered 
distribution of stick-slip events in a slowly sheared granular media taken from 
Ref. [2], fitted using Eq. (1). Although a modified power law was proposed in 
Ref. [2] to explain the results, the present fit also gives a better correlation 
coefficient. 

Here we presented four examples, but Eq. (1) can be used with excellent results 
in order to correct the Gutenberg-Ritcher law in earthquakes ranking, Benard 
convection cells and in many different fields, like arquitecture, music or roads 
[18]. 
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Fig. 4. Rank-ordered distribution of stick-slip events in a slowly sheared granular 
media. Circles are data taken from Ref. [2], and the solid line is a fit using Eq. (1), 
with a = 1.08 and b = 0.40 

3 Hierarchy in a multiplicative stochastic processes 



The previous section leads to the conclusion that both ending tails of the 
ranking present some degree of universality, and Eq. (1) seems to be an excel- 
lent fitting function due to the fact that it gives the right shape of the curve 
and thus very good correlation coefficients. Also, it is simple and can be re- 
duced to a pure power law by using an appropriate choice of a and b. As the 
{a, b} distributions is indeed ubiquitous, one can try to associate it to some 
generic mechanism, as happens in the central limit theorem or in the product 
of correlated probability distributions [14]. 

In the dynamics of population, scientific journal impact factor, codon usage 
and stick-slip events, there are many important issues that determine the be- 
havior. In the case of the impact factor we can cite for example the ability 
to select a good problem for investigation, the gift for writing clear papers, 
etc. Similar comments would be valid for the dynamics of granular media, 
as well as in economy, linguistics, genetics, etc. All of the previous systems 
share a common feature: their complex nature, i.e., they are build from many 
subsystems or path choices that produce a final result. One can try to model 
such complexity as follows. Consider a system made from N identical subsys- 
tems, where each can have s different states or choices with probability pj, 
and j = 1, s. When N such subsystems are put together, the state space 
consists of all s N possible sequences of length N. If we do not care about the 
order of the choices or states in the string, there are just (N+s — l)\/s\(N — 1)! 
different combinations. For example, if a system is made from N = 2 subsys- 
tems, where each has two states or choices, say 1 or 0, the possible global 
states are (0,0), (1,0), (0,1) and (1,1), while there are only three combina- 
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tions: (0, 0), (1, 1) and (1, 0), the last one has multiplicity 2. Each combination 
has a certain probability that we call reduced probabilities x^{ni,n 2 , ■■■,n s ). 
The multiplicity of each different state is given by the multinomial coefficient 
A^!/(ni!n 2 !n 3 !...n s !), where rij is the number of subsystems in the j-esim state. 
The probability of a global state of the whole system is, 

N\ 

P N (n 1 ,n 2 , n s ) = — ; — : — ; :X N (n 1 ,n 2 , n s ), (2) 

ni!n 2 !n3!...n s ! 

with ni + i%2 + n,3 + ...n s = N. However, we are interested in the rank of the 
observed different values of the macrostates, not in their distribution of prob- 
ability. To tackle this problem, we notice that each value xn{tii, ri2, ■ n s ) 
corresponds to a different macrostate of the system. In our example, the 
states (0,1) and (1,0) produce the same global macrostate. These two in- 
ternal states lead to one global state that has the same characteristics. If 
one assume that a certain characteristic (X) of a process or object is a func- 
tion of ni, n<2, n s , then each value of X(ni, n 2 , n s ) can be mapped to 
xn(ui, n 2 , n s ) and X(ni, n 2 , n s ) = X(xn(tii, n 2 , n s )). From the pre- 
vious considerations, is clear that any rank hierarchy of xn(tii, ri2, ■■■,n s ) will 
be inherited to X(jii, n 2 , ...,n s ). Thus, many different rank features of a system 
are reduced to study the hierarchy present in x n (ni,ri2, ...,n s ). 

For doing such study, there are two cases. In the first, the subsystems are 
independent, as in a Bernoulli process, 

x N (ni,n2,...,n s )=p^pT P T-P:% (3) 

and the other is the general case of interacting subsystems, in which the ad- 
dition of a new subsystem leads to a functional relationship of the type, 

x N+ i(ni, n 2 , n s ) = f(x N (n 1 ,n 2 , -,n s )). (4) 

In the next section we will consider only the case of independent subsystems, 
in which no extra information is needed in order to model the correlation in 
the system. This allows to produce the beta-like function in a simple form. 



4 The rank hierarchy as an algebraic problem 



For independent subsystems, an inspection of Eq. (3) shows that the rank 
structure can be reduced to the following algebraic problem. Take s numbers 
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Fig. 5. Succesive multiplication of three numbers p\ = 0.5202, p 2 = 0.3125, 
P3 = 0.1673 as a function of the rank (bold solid line), and a fitting using Eq. 
(1), with a = 9.36, b = 14.53. 

Pi, p 2 , ...,p s at random (normalization can be imposed at the end of the pro- 
cess), labeled in such a way that p\ > p 2 > ■■■ > p s , and multiply once each 
number by all the numbers in the set. With these resulting numbers, repeat the 
process N times to obtain a set of numbers that have the form P^pTpT ■■■VT ■> 
where n\+n 2 + ...+n s = N. If the resulting numbers are arranged in decreasing 
magnitude, we can assign a rank (r) to each one according to its order in the 
hierarchy. The rank r = 1 is assigned to p± , while the lowest rank r = R corre- 
sponds to p^ . For example, chose at random three numbers pi, p 2 and p 3 and 
form all the possible products: pf, P1P2, P1P3, P 2 , P2P3, pi- in Fig. 5, we present 
a plot of \ogx N (rii,n 2 ,n 3 ) as a function of r for N = 30 and pi = 0.5202, 
p 2 = 0.3125 and ^3 = 0.1673. Fig. 5 shows that the resulting ranks are well 
fitted by the same two parameter beta-like function, with a = 9.36 ± 0.2 and 
b = 10.52 ± 0.2, with a correlation coefficient of 0.972. The message from this 
numerical experiment is simple: if this product is seen as a multiplicative pro- 
cess where each number is the probability of making a certain choice or state 
in a process, then each possible result has a well determined hierarchy. 

The task that remains is how to calculate x N (rii,n 2 , ...,n s ) in terms of the 
rank. The problem is more easily solved using the logarithm of xjv(^i, n 2 , ...,n s ), 

logx N (ni,n 2 ,...,n s ) = nilogpi + n 2 \ogp 2 + ... +n s \ogp s . (5) 

Each set of values (ni, n 2 , n s ) is a point with integer coordinates in a 
s— dimensional space. Since n\ + n 2 + ... + n s = N, all the points are in a 
subspace of dimension s — 1. The problem of the rank is reduced to find a 
path between the maximal rank point (with coordinates (N, 0, 0, 0)) to the 
minimum (0, 0, 0, N) in such a way that logxjv(^i, n 2 , n s ) decreases in 
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Fig. 6. Path of decreasing rank in the ni,ri2 and n 3 space, for N = 15 and three 
random numbers p x = 0.5202, p 2 = 0.3125, p 3 = 0.1673. 

each step. For s = 2, the solution is easy to find. Using that rt\ + n 2 = N, 



with A = |ln(p 2 /pi) I • Eq. (7) shows that the numbers decrease in an exponen- 
tial way as a function of the rank. 

The case s = 3 can be easily visualized in Fig. 6, where the points in the 
integer lattice defined by Eq. (5) are shown as circles. 

A path between points of decreasing logxjv(ni, n 2 , n s ) is indicated as a line 
that joins the lattice points in Fig. 6, for a given set of numbers pi,p 2 and 
Pz- Figure 7 shows how the values of rii,n 2 and n 3 vary as a function of the 
range. A very complicated oscillatory pattern is seen , although a well defined 
envelope is also observed. This envelope is in fact the key to solve the problem, 
since it is the responsible of the ranking behavior. Notice also that all paths 
always start at (N, 0, 0) and finish at (0, 0, N), since logpi > logp2 > logp3- 

In general, since the index rij is a function of the rank r, we can write that 
rij = rij(r) where r is just the number of steps used to go from the point 
(N, 0, 0) to a certain (ni,n 2 , n 3 , n s ). It follows that, 



x N (n u n 2 ) = x N (n 2 ) = pf n2 p 2 2 , 



(6) 



from where it follows that the range is given by r = n 2 + 1. Then, 





log Xjy (r) = ni(r) logpi + n 2 (r) \ogp 2 + ... + n s (r) logp s 
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Fig. 7. Values of n\ (thin solid line), 772 (grey line) and 773 (solid bold line) as a 
function of the rank, for N = 20 and Pl = 0.5202, p 2 = 0.3125, p 3 = 0.1673. 



Fig. 8. Path of decreasing ranks in the 772 and 773 plane for p\ ~ p 2 3> p 3 , where the n\ 
coordinate was eliminated using that 771 + 772 + 773 = N. The dotted line corresponds 
to all the n,2MAx(r), which defines the envelope of the ranking sequence. 

The task is reduced to find the functions rij(r) for a given set {pj}- Consider 
again the case of an initial set of three numbers, s = 3. Using that ni+ri2+nz = 
N, log£jv(?") can be written as, 

logxjv(r) = N\ogp 1 + n 2 (r) log5 2 i + n 3 (r) log5 31 . (9) 

with 5 2 i — P2/P1 and 5 3 \ = p 3 /pi- The solution for any set pi,p 2 ,P3 is 
complicated, because some paths are not periodic. However, one can work out 
first the cases p\ ~ p 2 ^> p 3 and p\ p 2 ~ p 3 that give insights about how to 
treat others. 
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2i ^ ^3i- The corresponding 



Let us first consider the limit p± ~ p 2 ^> p 3 , and S 2 
path is easy to find because it is similar to an odometer with an increased 
range after each turn, as seen in Fig. 8, due to the hierarchy 1 > 5 2 i > S21 > 



^3i > ^21^31 > £31 > ••• > S 31 . For example, when N = 2 this leads to the 



following table that contains the number xw(r) as a function of the rank, and 
the corresponding path given by ri2 and n^, 



x N (r) 


n 2 


n 3 


r 


n 2 M{r) 


p\ 








1 




pIs 21 


1 





2 




plSli 


2 





3 


2 


Plhi 





1 


4 




pjs 21 s 31 


1 


1 


5 


1 


Pisl, 





2 


6 






The sequence of the path goes as follows, first n 2 (r) is increased one by one as 
113 remains constant, until it reaches a maximal value called n2MAx{r) which 
in fact determines the envelope of the ranking sequence and thus the basic 
shape of the curve Xn(j') (the envelope that contains n 2 MAx(r) is shown in 
Fig. 8 as a dotted line). Once ri2(r) increases from zero to n,2MAx(r), a new 
cycle begins with n 2 (r) = and n^ij- + 1) = n^r) + 1. As a result, the number 
of steps r to reach U2max (r) is given by, 

n 2M ^W n 2MAX (r) (n 2MAX (r) + 1) nm 

R-r ~ n 2M Ax(r)+ }^ 3=n 2 MAx{r) + ,(10) 

3=1 1 

where R is the maximal rank. Then, 



n 2 MAx(r) ~ N i 1 ^ ~r) 



x 1/2 



The corresponding value of n 3 (r) can be obtained from the condition n 2 +n 3 < 
N. Finally, the number as a function of the rank is given by, 



x N (r) 




N 



(12) 



Figure 9 shows the excellent agreement between Eq. (12) and the curve ob- 
tained for Pl = 0.5250, p 2 = 0.4250, p 3 = 0.000047. Furthermore, Eq. (12) can 
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Fig. 9. Numerical results for the ranking of the succesive product of three numbers 
such that pi ~ p2 2> ps- The smooth line is the prediction using Eq. (12). 

be written as an stretched exponential as follows, 



with D = N |log(p 2 /p3)| and R is the maximal value of r. Notice in Fig. 9 how 
this formula works better as the rank approaches R. 

The case pi ^> P2 ~ P3 can be tackled in a similar way. The result is, 



with E = N (log(pi/p3) — l°g(P2/p3)), and as shown in Fig. 10, the agreement 
is also good, specially for low values of r. 

Now consider the general case in which p\jp2 and p^ have the same order of 
magnitude, as in Fig. 5, where two tails appears, one for small r and the other 
at r near R. The tail at low r is produced basically by the hierarchy in the 
biggest probabilities, i.e., by numbers where ri\ ~ N. In a similar way, the tail 
for r near R is produced by the lowest probability hierarchy, ~ N. These 
dominant factors are due to large statistical deviations and are the origin of 
the long tails in the otherwise power law observed in the ranks. The main 
effect in these tails when p p 2 ~ p% is that the sequence of ordering is 
not uniform as can be observed in Fig. 5, for which a very complicate path 
appears. As a result, Eq. (10) changes with the apparition of new subcycles in 
the rank path. These changes are the result of the increasing number of cycles 
in the odometer that we have discussed, as is also clear from the change in the 
exponents that are transformed from 1 to 1/2 as s goes from s = 2 to s = 3. 




(13) 




(14) 
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Fig. 10. Ranking of the succesive product of three numbers such that p\ 3> P2 ~ P3, 
for pi = 0.99999, p 2 = 6.2 x 10~ 6 , p 3 = 3.8 x 10~ 6 . The dashed line is the prediction 
made from Eq. (14), compared with the numerical result for N = 100 iterations 
(solid line). 

Eq. (13) is thus transformed into a generalized expression, 



x N (r) w exp 



(15) 



in which (3 is a yet unknown exponent, always less than one. In a similar way, 
Eq. (14) should be replaced by, 



xn\t) ~ p x exp 



(16) 



with a < 1. These generic exponents for the tails also appear for s > 3 since 
from the polynomial equivalent to Eq. (10), one gets a ~ f3 ~ l/(s — 1). A 
simple procedure to combine the tails represented by Eq. (15) and Eq. (16) is 
obtained by making the observation that for a given tail, only one stretched 
exponential produces a curved tails in a semi- log plot, while the other tends 
toward a constant, i.e., if we consider the derivative of Eq. (15), 



'd\nx N (r)\_ PD( 1 _r_\f>-i 



\ dr J R V R 

is clear that x' N (r) is nearly a constant if r <^ 1, corresponding to the limit 
in which Eq. (16) has greater curvature. Analyzing the limit r — > R gives a 
similar result, 

'<ilnxAr(r)\ aE ( r \ tt ^ 

. (18) 



From these considerations, a simple way to produce a function with the re- 
quired dependences when r — > R and r — * 1 is the following, 



12 



Fig. 11. A plot of Eq. (19) using Ci = 2, D = 1, E = 1, (3 = 1 and a = 1. 



x^(r) m C\ exp 



D 1 



r — 1 



exp 



(19) 



where C\ is a constant. A plot of the previous expression is presented in Fig. 
11, showing the basic shape of the studied beta function. 

Finally, Eq. (19) can be simplified when many states are present since for 
s > 1, a « j8 « l/(s - 1) and thus a — > and (3 — > 0. Then, by using the 
observation about the derivatives that appears in Eq. (17) and Eq. (18), one 
can approximate the derivatives like in Eq. (17) as follows, 



log xjv(r) 
dr 



(3D 



r \ 0-i 

r) 



(3D 



r 
R 



(20) 



A similar thing can be done in the tail r — > 1, for which a can be neglected 
with respect to one in Eq. (18). Combining both tails in a sole expression we 
get, 



'dlogx N (r) 
dr 



(3D 



r 
R 



-i 



aE /r 



By integrating the previous equation, we finally obtain the beta-like function 
given by Eq. (1), where the exponents a and b are given by, 



a = aE and b = (3D 



(21) 



Thus, the beta-like function is obtained when we have a large number of states 
in the system. Notice how the parameters a and b are determined mainly by 
the behavior in the tails. 
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5 Conclusions 



In conclusion, we found a simple formula that allows to fit many different 
rank phenomena. This formula shows that there is a certain universality at 
the tails, explained by considering the ranking of a multiplicative process. We 
have shown that such problem is equivalent to an algebraic problem: find the 
rank of the successive product of numbers. A task that remains is to how to 
get the coefficients a and b from physical principles, using for example master 
equations and the concept of multiscaling modelling. A key observation for 
such study is that for expansion-modification algorithms in DNA models, a > b 
if the expansion probability of the genetic code is bigger the than mutation rate 
[21]. Thus, a and b represent the relative influence of two general mechanisms, 
where each of them dominate at a given tail. According to some preliminary 
results, a seems to be related with a certain funnel type of energy landscape, 
as in protein folding, which leads to a certain deterministic sequence, while b 
is associated with a many valley landscape, as seen in spin glasses. This last 
opposite effect provides much more variability in the sequence of results. Such 
correlation is consistent with associating b to the stochastic component of the 
dynamics and a with the most deterministic features [21]. In future work, we 
will elucidate with more detail such mechanisms. 
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